| Dataset | Before filtering | After CD-HIT | After partitioning (train-test) |
After partitioning (independent) |
|---|---|---|---|---|
| N_IM | 59 | 59 | 50 | 6 |
| N_OM | 59 | 56 | 46 | 4 |
| N_TM | 276 | 222 | 192 | 30 |
| N_S | 357 | 340 | 287 | 53 |
| N_TL_SEC | 49 | 43 | 37 | 4 |
| N_TL_TAT | 84 | 89 | 67 | 6 |
| P_IM | 187 | 128 | 106 | 11 |
| P_TM | 4456 | 1237 | 1073 | 156 |
| P_S | 1417 | 419 | 360 | 42 |
We use train-test dataset obtained after homology partitioning to perform 5-fold CV repeated 5 times. Based on the CV results we want to select the optimal architecture (highest mean kappa). We use repeated CV to reduce variance of performance measures and check architecture stability.
How to select the best architecture? - select the one with the highest kappa regardless of standard deviations? Include standard deviations in the decision (how?)? use some other method?
Independent dataset - is that enough? Test it on proteins from other organisms not possesing ‘typical’ chloroplasts, such as Paulinella or Plasmodium (problem with data availability - will have to check the literature).
Jackknife - everyone else is doing that but it has its problems (tends to overestimate model performance, doesn’t work well with GLMs)